Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

نویسندگان

چکیده

Pretrained multilingual text encoders based on neural Transformer architectures, such as BERT (mBERT) and XLM, have achieved strong performance a myriad of language understanding tasks. Consequently, they been adopted go-to paradigm for cross-lingual representation learning transfer, rendering word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) unsupervised settings 2) ad-hoc IR (CLIR) Therefore, in work we present systematic empirical study focused the suitability state-of-the-art document sentence retrieval tasks across large number pairs. In contrast supervised understanding, our results indicate that document-level CLIR – setup with no relevance judgments IR-specific fine-tuning pretrained fail significantly outperform models CLWEs. For sentence-level CLIR, demonstrate can be achieved. peak is not met using general-purpose ‘off-the-shelf’, but rather relying their variants further specialized

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A multilingual text mining approach to web cross-lingual text retrieval

To enable concept-based cross-lingual text retrieval (CLTR) using multilingual text mining, our approach will first discover the multilingual concept–term relationships from linguistically diverse textual data relevant to a domain. Second, the multilingual concept–term relationships, in turn, are used to discover the conceptual content of the multilingual text, which is either a document contai...

متن کامل

Experiences in evaluating multilingual and text-image information retrieval

One important step during the development of information retrieval ~IR! processes is the evaluation of the output regarding the information needs of the user. The “high quality” of the output is related to the integration of different methods to be applied in the IR process and the information included in the retrieved documents, but how can “quality” be measured? Although some of these methods...

متن کامل

Cross-lingual thesaurus for multilingual knowledge management

The Web is a universal repository of human knowledge and culture which has allowed unprecedented sharing of ideas and information in a scale never seen before. It can also be considered as a universal digital library interconnecting digital libraries in multiple domains and languages. Beside the advance of information technology, the global economy has also accelerated the development of inter-...

متن کامل

Multilingual thesauri in cross-language text and speech retrieval

This paper sets forth a framework for the use of thesauri as knowledge bases in cross-language retrieval. It provides a general introduction to thesaurus functions, structure, and construction with particular attention to the problems of multilingual thesauri. 1 Thesaurus functions A thesaurus is a structure that manages the complexities of terminology in language and provides conceptual relati...

متن کامل

Unsupervised Cross-Lingual Lexical Substitution

Cross-Lingual Lexical Substitution (CLLS) is the task that aims at providing for a target word in context, several alternative substitute words in another language. The proposed sets of translations may come from external resources or be extracted from textual data. In this paper, we apply for the first time an unsupervised cross-lingual WSD method to this task. The method exploits the results ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-72113-8_23